In Python, most things can be done with "import amazing_package_already_done_by_someone_else as solution_to_all_my_problems" Packages can be installed with pip, or with the conda environment manager (recommended).
import pandas as pd #pandas gives us access to the dataframe object
import numpy as np #numpy is the basic mathematical package for vector operations
import seaborn as sns #seaborn gives us a convenient plotting API
import plotly.express as px #plotly gives us an easy way to create interactive graphs
There are many ways to read data for use in Python. The simplest is to use Pandas to read a csv file. By default we use comma deliminated, utf-8 encoding. Files generated on windows with Excel can have different encoding formats depending on the version, which might require a little of plyaing with the parameters to find the right setting...
raw_df = pd.read_csv("winemag-data_first150k.csv") #the file is saved in the same folder as the notebook, so no need to give a path
type(raw_df) #since we opened the data with Pandas, it's already a DataFrame object
Let's see what data we have inside the dataframe
raw_df.head()
head() is a function. I can get help on any function with help(). Having pandas' documentation open is a tab is also a good idea: https://pandas.pydata.org/docs/
help(pd.DataFrame.head)
It seems we have more than one country. Let's check what is there
raw_df["country"].unique()
For the moment, I'm only interested in French wines, so let's create a French wine DataFrame
france_df = raw_df[raw_df["country"]=="France"]
How many French wines do we have?
len(france_df)
# type your answer in that cell
Our data contains two numerical variables: "points" and "price". Let's draw histograms to vizualize them
france_df[['points',"price"]].hist(bins=20)
The points are all between 80 and 100, but we have some crazy high prices in that data. Let's look at the 10 most expensive wines
france_df.sort_values(by='price',ascending=False).head(10)
# type your answer in that cell
Looking at 10 most expensive (and the 10 cheapest), it seems there is a relationship between price and quality (points). Let's check with a simple regression
sns.lmplot(x="price",y="points",data=france_df) #lm means linear model. We could also use relplot or regplot
That graph doesn't look great. The relationship isn't linear. Maybe we can use the log of the price instead. We calculate log_price variable and then redo the plot with that variable
france_df["log_price"] = np.log(france_df["price"]) #creates a new variable called log_price
sns.lmplot(x="log_price",y="points",data=france_df)
All French wines are not the same, let's look at the differences in price and points by region. We can do a pivot table with groupby
france_df.groupby("province").agg({'price':['count','mean'],'points':['mean','max','min']})
# type your answer in that cell
Pivot tables are fine, but for a nicer looking way of looking at the data, a chart can be a better idea
sns.boxplot(y="province",x="points", data=france_df)
Beaujolas wines are very cheap, but some of them might be good. What are the best Beaujolais for less than 20$?
france_df[(france_df['province']=="Beaujolais")&(france_df['price']<=20)].sort_values(by="points", ascending=False).head(5)
# type your answer in that cell
So far we've only looked at the French data. Let's look at the relative quality of wines by country
world_df = raw_df
Let's create a relative_value variable so we can compare
world_df["relative_value"] = world_df["points"] / np.log(world_df["price"])
world_df["relative_value"] = world_df["relative_value"] * 100 / np.max(world_df["relative_value"]) # we rescale from 0 to 100 (max)
And now we can compare the relative value by country
sns.barplot(y="country",x="relative_value", data=world_df[world_df['country'].isin(['France',"Italy","Spain","US","Chile","Australia"])])
# type your answer in that cell
French wine have the worse value? There is a problem with the "points" score in that data set. Let's correct that
I'm going to define a relative scaling factor to adjust the points according to my tastes
scaling_factor = {
'France':1.1, #let's multiply all French wine scores by 1.1
'Italy':0.8,
'Spain':0.7,
'US':0.9,
'Australia':0.8,
'Chile':0.5 #yuck
}
That's a perfect occasion to use merge. Let's transform scaling_factor into a DataFrame and merge it with word_df
scaling_df = pd.DataFrame(
data=scaling_factor.items(), #items() extracts the key/values pairs from the dictionnary
columns=["country","scaling_factor"]
)
scaling_df
And now we can merge with word_df
country_df = world_df.merge(scaling_df, left_on="country", right_on="country", how="inner") # I'm doing an inner joint => we're dropping the countries for which I didn't define a scaling factor
country_df.head()
Now let's recalculate the relative value and replot
country_df["julien_value"] = country_df["points"] * country_df["scaling_factor"] / np.log(country_df["price"])
country_df["julien_value"] = country_df["julien_value"] * 100 / np.max(country_df["relative_value"]) # we rescale from 0 to 100 (max)
sns.barplot(y="country",x="julien_value", data=country_df)
That looks much better!
# type your answer in that cell
To finish, let's do something a little more complex. For the 5 most popular varieties, let's display an interactive graph that shows the price and points of the wines with the highest value
variety_short_list = country_df['variety'].value_counts()[0:4].index # taking the 5 most popular varieties
for variety in variety_short_list:
temp_df = country_df[country_df['variety']==variety].sort_values(by="julien_value", ascending=False).head(500)
fig = px.scatter(temp_df, size="julien_value",x="price",y="points",color="province",hover_name="winery",title=variety)
fig.show(renderer="notebook")
# type your answer in that cell
#export to HTML for sharing
!jupyter nbconvert --to html Basic_Python_For_DataScience.ipynb